The goal of emotion detection is to find and recognise emotions in text, speech, gestures, facial expressions, and more. This paper proposes an effective multimodal emotion recognition system based on facial expressions, sentence-level text, and voice. Using public datasets, we examine face expression image classification and feature extraction. The Tri-modal fusion is used to integrate the findings and to provide the final emotion. The proposed method has been verified in classroom students, and the feelings correlate with their performance. This method categorizes students' expressions into seven emotions: happy, surprise, sad, fear, disgust, anger, and contempt. Compared to the unimodal models, the suggested multimodal network design may reach up to 65% accuracy. The proposed method can detect negative feelings such as boredom or loss of interest in the learning environment.